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Abstract 

A novel algorithm for wide-baseline matching called MODS - Matching On 
H^dmand with view Synthesis - is presented. The MODS algorithm is experimen- 
^t^ly shown to solve a broader range of wide-baseline problems than the state 
o£dhe art while being nearly as fast as standard matchers on simple problems. 
O^e apparent robustness vs. speed trade-off is finessed by the use of progres- 
T ST^ely more time-consuming feature detectors and by on-demand generation of 
athesized images that is performed until a reliable estimate of geometry is 
Stained. 

We introduce an improved method for tentative correspondence selection, 
dicable both with and without view synthesis. A modification of the standard 
st to second nearest distance rule increases the number of correct matches by 
£0% at no additional computational cost. 

Performance of the MODS algorithm is evaluated on several standard pub- 
• fiely available datasets, and on a new set of geometrically challenging wide base- 
pft^e problems that is made public together with the ground truth. Experiments 
£h£>w that the MODS outperforms the state-of-the-art in robustness and speed. 
Moreover, MODS performs well on other classes of difficult two-view problems 
like matching of images from different modalities, with wide temporal baseline 
or with significant lighting changes. 

Keywords: wide baseline stereo; image matching; local feature detectors, local 
feature descriptors 


1. Introduction 

The wide baseline stereo PQ problem - the automatic estimation of a geo¬ 
metric transformation and the selection of consistent correspondences between 
view pairs separated by a wide baseline - has received significant attention in 
the last 15 years State-of-art local feature detectors a, 0 , 0,0 and 

descriptors m, 0,0 allow to match images of a scene with a viewing angle 
difference up to 60° for planar objects [9 and 30° for non-planar 3D objects 
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uni. Fast detectors m, m and binary descriptors m, m, ng make match¬ 
ing significantly faster at the cost of decreasing tolerance to scale, rotation and 
affine changes. At the other end of the spectrum of wide baseline problems, the 
ASIFT matching scheme mm, increased the range of handled viewing angle 
differences up to the 80° at the cost of a significant slow-down. 

We propose a novel two-view matching algorithm called MODS - matching 
with on-demand view synthesis - that handles viewing angle difference even 
larger than the state-of-the-art ASIFT algorithm, without a significant increase 
of computational costs over “standard” wide and narrow baseline approaches. 
The performance gain is achieved by introducing a number of improvements to 
the wide-baseline matching process. 

First, MODS employs a combination of different detectors. It is known that 
different detectors are suitable for different types of images [9] and that some 
detectors are complementary in the type of structures in the image they respond 
to [18]. Moreover, we show that the combination of the different detectors allows 
increasing the average speed of the matching and to match pairs of images 
which can not be solved by any of the detectors alone. The results indicate that 
searching for the “best” detector leads to data- and problem-specific outcomes 
and that exploiting multiple detectors is superior to any single one. 

Second, we introduce an iterative scheme which follows the “do only as 
much as needed” principle. Progressively more powerful yet slower detectors 
and descriptors are applied, together with more images synthesized on-demand, 
until sufficient support for a two-view geometry estimate is obtained. Such on 
demand approach finesses an apparent robustness vs. speed trade-off, avoiding 
the slowdown for easy wide baseline problems brought by the time-consuming 
operations needed for solving the most challenging pairs. 

Third, a novel tentative correspondences generation strategy is presented 
which generalizes the standard first to second closest distance ratio [6]. The 
selection strategy which shows performance superior to the standard method is 
applicable to any vector descriptor like SIFT [6-, LIOP [19] and MROGH [20] . 

The parameters of the MODS algorithm were optimized and its performance 
thoroughly evaluated. The optimization included the selection of the particular 
sequence of feature detectors, the choice of the number and parameters of images 
synthesized to facilitate matching and the parameter setting of the individual 
detectors. 

The performance of the MODS algorithm was validated on several publicly 
available datasets and it was compared to the state-of-the-art in both speed and 
robustness, i.e. the ability to recover the two-view geometry reliably. We show 
that MODS significantly outperforms prior approaches in both robustness and 
speed. We have collected a set of image pairs for evaluating MODS on wide 
baseline problems with very large angular difference between views. These form 
the Extreme View Dataset. The dataset with the ground truth and the source 
code of the MODS algorithm is available on the authors web-pag^J 


1 Available at http://cmp.felk.cvut.cz/wbs/index.html 
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Algorithm 1 The standard two view matching scheme 
Input: ii, I2 - two images. 

Output: Fundamental or homography matrix F or H respectively; 
a list of corresponding features. 


for 1 1 and I2 independently do 

1.1 Detect and describe local features, 
end for 

2 Generate tentative correspondences for Ii and I2 using the 2 nd closest ratio. 

3 Geometrically verify tentative correspondences using RANSAC 
while estimating H or F. 


2. Related work 


The standard wide baseline matching pipeline (see Alg. [T]) begins with the 
detection of local features, computation of descriptors, generation of tentative 
correspondences and ends with geometric verification using the homography or 
epipolar constraint. Matching images of a scene with viewpoint difference up to 
60° for planar objects [2] and 30° for non-planar 3D objects m was reported. 
The execution time varies from a fraction of a second to seconds for 800x600 
images [9]. 

The idea of generating synthetic views to improve a local feature based wide 
baseline matching pipeline was first explored by Lepetit and Fua [211 . They syn¬ 
thesized views to find distinctive keypoints repeatedly detectable under affine 
deformations. Synthetic views provided a training set for learning a random for¬ 
est classifier that labeled individual feature points. Feature points in different 
images with the same label were assumed to be in correspondence. The simple 
keypoint detector of Lepetit and Fua is very fast, but invariant only to transla¬ 
tion and rotation and thus the number of views necessary to achieve acceptable 
repeatability was high. The method was tested on pairs undergoing significant 
affine transformations, but the final representation did not scale and can not be 
easily used for indexing. 

Recently, Morel et.al. [16] proposed a new matching pipeline - see Alg. [ 2 J 
The authors showed that view synthesis extends the handled range of view¬ 
point differences. The ASIFT algorithm starts by generating synthetic views 


(described in Section 3.1) for both images. Next, feature detection and descrip¬ 


tion are performed using standard SIFT [6| in each synthesized view. Tentative 
correspondences are formed for all pairs of views synthesized from the first 
and second image. The matching stage thus entails n 2 independent matching 
problems, where n is the number of synthesized views per image. The set of 
correspondences between the images is the union of results for all synthesized 
pairs. The duplicate filtering stage of ASIFT prunes correspondences with small 
spatial distance (2 pixels) of local features in both images - all such correspon¬ 
dences except one (random) are eliminated from the final correspondence set. 
“ One-to-many” correspondences - correspondences of features which are close 
to each other (are situated in radius of y/2 pixels) in one image while spread 
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Algorithm 2 ASIFT 
Input: ii, I2 - two images. 

Output: List of corresponding points; fundamental matrix F. 


for 1 1 and I2 independently do 

1.1 Generate synthetic views according to the tilt-rotation-detector setup. 

1.2 Detect and describe local features, 
end for 

2 Generate tentative correspondences for each pair of the synthesized views 
of 

/1 and I2 independently using the 2 nd closest ratio. 

3 Add correspondences to the general list. 

Reproject corresponding features to original images. 

4 Filter duplicate, “one-to-many” and “many-to-one” matches. 

5 Geometrically verify tentative correspondences using ORSA m 
while estimating F. 


in other synthetic views are also eliminated, despite the fact that some of the 
pairs can be correct. Finally, geometric verification is performed by ORSA [22]. 
ORSA is a RANSAC-based method, which exploits an a-contrario approach to 
detect incorrect epipolar geometries. Instead of having a constant error thresh¬ 
old, ORSA looks for matches that have the highest “diameter”, i.e. matches 
which cover a large image area. ASIFT was shown to match images of a scene 
with viewpoint difference up to 80° for planar objects m • Computational costs 
are in the order of tens of seconds to a few minutes. 

The latest extensions of wide-baseline matching pipeline are limited to mod¬ 
ifications of the ASIFT algorithm. Liu et.al. m synthesized perspective warps 
rather than affine. Pang et.al [24] replaced SIFT by SURF [71 in the ASIFT 
algorithm to reduce the computation time. Forssen and Lowe [25] proposed to 
detect MSER on scale pyramid, which might be seen as scale synthesis. 

3. The MODS algorithm 

The main idea of the proposed iterative MODS algorithm (see Alg. |3| is to 
repeat a sequence of two-view matching procedures, until a required number 
of geometrically verified correspondences is found. In each iteration, a differ¬ 
ent and potentially complementary detector is used and a different set of views 
synthesized. The algorithm starts with fast detectors with limited invariance 
proceeding progressively with more complex, robust, but computationally costly 
ones. MODS is thus capable of solving simple matching problem fast without 
loosing the ability to deal with very difficult cases where a combination of de¬ 
tectors is employed to extend the state-of-the-art. 

The adopted sequence of detectors and view synthesis parameters is an out¬ 
come of extensive experimental search. The objective was to solve the most 
challenging problems in the development set, i.e. to correctly recover their two- 
view geometry, while keeping the speed comparable to standard single-detector 
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Algorithm 3 MODS 

Input: ii, I2 - two images; - minimum required number of matches; 

'S'max - maximum number of iterations. 

Output: Fundamental or homography matrix F or H; list of corresponding points. 
Variables: iV mat ches - detected correspondences, Iter - current iteration. 


while (iVmatches < #m) and (Iter < ^max ) do 

for ii and I2 independently do 

1.1 Generate synthetic views according to the 
scale-tilt-rotation-detector-descriptor setup for the Iter (Tables [l] [ 4 J. 

1.2 Detect and describe local features. 

1.3 Reproject local features to original image. 

Add described features to general list. 

end for 

2 Generate tentative correspondences using the first geom. inconsistent 
rule. 

3 Filter duplicate matches. 

4 Geometrically verify tentative correspondences with DEGENSAC [ 26 ] 
while estimating F or H. 

5 Geometrically verify inliers with local affine frame shape. 

end while 


wide-baseline matchers for simple problems. Details about the selected config¬ 
uration and the optimization process are given in Section [4] The rest of the 
section describes the steps involved in the iterations of the MODS algorithm, 
which is compared to the standard two view matching and ASIFT pipelines. 

3.1. Synthetic views generation 

MODS (Alg. [ 3 ]) starts by synthetic view generation. It is well known that a 
homography H can be approximated by an affine transformation A at a point us¬ 
ing the first order Taylor expansion. The affine transformation can be uniquely 
decomposed by SVD into a rotation, skew, scale and rotation around the optical 
axis [27] . In [16], the authors proposed to decompose the affine transformation 
A as 

A = HxR^TtR^) = 

_ ^ /cos-0 — sim/A/t 0\ /cos (f) —sin<A (1) 

\sinT/ cost/ J \0 lj \sin</ cos</ ) 

where A > 0, R\ and R 2 are rotations, and T t is a diagonal matrix with t > 1. 
Parameter t is called the absolute tilt, G (0, tt) is the optical axis longitude and 
^ E (0, 27r) is the rotation of the camera around the optical axis (see Figure [l]). 
Each synthesized view is thus parametrized by the tilt, longitude and optionally 
the scale and represents a sample of the view-sphere resp. view-volume around 
the original image. 

The view synthesis proceeds in the following steps: at first, a scale synthesis 
is performed by building a Gaussian scale-space with Gaussian a = abase * S 
and downsampling factor S (S < 1). Then, each image in the scale-space is 
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Figure 1: (left) the affine camera model 0 - Latitude 6 = arccos 1/t — latitude, longitude </>, 
scale A scale, (right) Transitional tilt r for absolute tilt t and rotation cf). 


in-plane rotated by longitude (\> with step A</> = A^bas e/t. In the third step, all 
rotated images are convolved with a Gaussian filter with a = abase along the 
vertical and a = t • abase along the horizontal direction to eliminate aliasing in 
a final tilting step. The final tilt is applied by shrinking the image along the 
horizontal direction by factor t. The synthesis parameters are: the set of scales 
{S}, A^base ~ the longitude sampling step at tilt t = 1 , the set of simulated 
tilts {t}. 

3.2. Local feature detection and description 

The second step of MODS is detection and description of local features. It 
is known that different local feature detectors are suitable for different types of 
images [9] and that some detectors are complementary in the image structures 
they respond to m- Our experiments show (see Section [5]) that combining 
detectors improves the overall robustness and speed of the matching procedure. 

MODS combines a fast similarity covariant FAST (in ORB implementation) 
detector and affine covariant detectors MSER and Hessian-Affine. The nor¬ 
malized patches are described by the binary descriptor BRIEF [13| (in ORB 
implementation) and a recent modification of SIFT [6] - the RootSIFT [28] . 
The local feature frames computed on the synthesized views are backprojected 
to the coordinate system of the original image by the known affine matrix A 
and associated with the descriptor and the originating synthetic view. MODS 
steps configuration are specified in Table [l] 

For the MSER and Hessian-Affine detectors, the fast affine feature extraction 
process from [29] was applied. 

3.3. Tentative correspondence generation 

The next step of the MODS algorithm is the generation of tentative cor¬ 
respondences. Different strategies for the computation of tentative correspon¬ 
dences in wide-baseline matching were proposed. The standard method for 
matching SIFT (-like) descriptors is based on ratio of the distances to the closest 
and the second closest descriptors in the other image |6J. While the performance 
of this test is in general very good, it degrades when multiple observations of the 
same feature are present. In this case, the presence of similar descriptors will 
lead to the first to second SIFT ratio to be close to 1 and the correspondences 
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Table 1: MODS step configurations are defined by a detector, descriptor and the set of the 
synthesized views. RootSIFT is used for all detectors but ORB which is described by BRIEF. 


Iter. 

Setup 

1 

ORB,{S} = {1}, {1} = {1}, A <j> = 360 °/t 

2 

ORB,{S'} = {1}, {t} = {1; 5; 9}, A 4> = 360 °/t 

3 

MSER,{5} = {1; 0.25; 0.125}, {1} = {1}, A0 = 360 °/t 

4 

MSER,{5} = {1; 0.25; 0.125}, {t} = {1; 3; 6; 9}, A0 = 360 °/t 

5 

HessAff, {S} = {1}, {t} = (1; 2; 4; 6; 8}, A 4> = 360 °/t 

6 

HessAff, {S} = {1}, {*} = (1; 2; 4; 6; 8}, A 4> = 120 °/t 

7 

HessAff , {S} = {1}, {1} = (1; 2; 4; 6; 8; 10}, A <p = 60° /t 


Table 2: MODS variants tested. The corresponding steps are the same as in Table [T] The 
tarred MODS:3* -6* configuration slightly differs from Table [l| and correspon ds to [5o[7 


Name 

Steps 

MODS == MODS:l-7 

1 

2 

3 

4 

5 

6 

7 

MODS:2-7 

- 

2 

3 

4 

5 

6 

7 

MODS:3*-6 

- 

- 

3* 

4* 

5 

6 

- 

MODS:3-7 

- 

- 

3 

4 

5 

6 

7 

MODS:3,2-7 

3 

- 

2 

4 

5 

6 

7 

MODS:2,4,7 

- 

2 

- 

4 

- 

- 

7 

MODS:5,l-7 

5 

1 

2 

3 

4 

6 

7 


will ” annihilate” each other, despite the fact they represent the same geometric 
constraints and are therefore not mutually contradictory (see Figure [ 5 ]). The 
problem of multiple detections is amplified in matching by view synthesis since 
covariantly detected local features are often repeatedly discovered in multiple 
synthetic views. 

To address this problem, we propose a modified matching strategy denoted 
first to first geometrically inconsistent - FGINN . Instead of comparing the first 
to the second closest descriptor distance, the distance of the first descriptor and 
the closest descriptor that is geometrically inconsistent with the first one is used. 
We call descriptors in one image geometrically inconsistent if the Euclidean 
distance between centers of the regions is > n pixels (default: n = 10). The 
difference of the first-to-second closest ratio strategy and the FGINN strategy 
is illustrated in Figure [2] 

3.4- Geometric verification 

The last step of the MODS is the geometric verification. It consists of three 
substeps. 

3.4- 1. Duplicate filtering 

The redetection of covariant features in synthetic views results in duplicates 
in tentative correspondences. The duplicate filtering prunes correspondences 
with close spatial distance (« 5 pixels) of local features in both images - all 
these correspondences except one - with smallest descriptor distance ratio - are 
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Figure 2: Green regions - correct correspondences rejected by the standard first to second 
closest ratio test (second closest region is in red), but recovered by the first to first geometri¬ 
cally inconsistent ratio (first geometrically inconsistent region is in yellow) matching strategy. 
DoG regions (top), MSERs (bottom). 


eliminated from the final correspondences list. The number of pruned corre¬ 
spondences can be used later for evaluating the quality (probability of being 
correct) in PROSAC-like [31] geometric verification. 

3.4.2. RANSAC 

The LO-RANSAC [32] algorithm searches for the maximal set of geometri¬ 
cally consistent tentative correspondences. The model of the transformation is 
set either to homography or epipolar geometry, or automatically determined by 
a DegenSAC [26] procedure. 

3.4-3. Local affine frame check 

Since the epipolar geometry constraint is much less restrictive than a ho¬ 
mography, wrong correspondences consistent with some (random) fundamental 
matrix appear. The local affine frame consistency check (LAF-check) eliminates 
virtually all incorrect correspondences. The procedure uses coordinates of the 
closest and furthest ellipse points from the ellipse center of both matched local 
affine frames to check whether the whole local feature is consistent with esti¬ 
mated geometry model (see Figure [3]). The check is performed for the geometric 
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Figure 3: The LAF-check. While centers of both regions A and B are consistent with found 
homography, farthest (1) and closest (2) points of the ellipse pass the check only for region A. 


model obtained by RANSAC. Regions which do not pass the check are discarded 
from the list of inliers. If the number of correspondences after the LAF-check 
is fewer than the user defined minimum, matcher continues with the next step 
of view synthesis. 


4. Implementation and parameter setup 

In this subsection, we discuss the tilt-rotation-detector setups of the MODS 
algorithm, and threshold selection for the first to first geometrically inconsistent 
- FGINN matching strategy validation. 

4-1. View synthesis for different detectors and descriptors 

The two main parameters of the view synthesis, tilt {t} sampling and the 
latitude step A</>base, were explored in the following synthetic experiment. 

A set of simulated views with latitudes angles 6 = (0, 20, 40, 60, 65, 70, 75, 
80, 85°), corresponding to tilt series t = (1.00, 1.06, 1.30, 2.00, 2.36, 2.92, 3.86, 
5.75, 11.470 was generated for each of 150 random images from the Oxford 
Building Dataset^] [33]. Example images are shown in Figure [IJ The ground 
truth affine matrix A was computed for each simulated view using equation 0 
and used in the final verification step. The original image was matched against 
its warped version, and the running time and number of inliers for each combi¬ 
nation of the detector, tilt and rotation (see Table [3]) were computed. In all, 84 
setups for each of the 8 detectors on the 150 image pairs were evaluated. As an 
example, we show the relation between the density of the view-sphere sampling 
and the number of images matched for the DoG detector in Figure [5| 

Since our goal is to find a variety of detector-tilt-rotation configurations op¬ 
erating with different matching ability - run-time trade-offs, we defined “easy”, 


2 assuming that the original image is the fronto-parallel view 
3 available at http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/ 
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Figure 4: Two examples of image sets from the synthetic dataset. Original unwarped images 
are from the Oxford buildings dataset m- 



Figure 5: Top: the percentage of images matched depending the synthetic viewpoint difference 
for the DoG detector with tilt configurations given in Table [3] Left: different tilt synthesis 
configurations (for Acf) = 240°), right: different rotation synthesis configurations (for tilt set 
(d) = {1,5,9}). Bottom: running time per image pair for the respective configuration. 
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Table 3: View synthesis configuration evaluation. Configuration is a triplet - the detector, 
descriptor and the set of the synthesized views. 


Detector-descriptor combination 

DoG-SlFT, Hessian-Affine-SIFT, Harris-Affine-SIFT, 
MSER-SIFT, SURF-SURF, SURF-FREAK, 
AGAST-FREAK, ORB 

Tilt set 

a = {1}, b = {1,2}, c = {1,8}, d = {1,5,9}, e = 

= {1,4,8}, g = {1,2,4,8}, h = (1,3, 6 , 9}, i = 
{1,4,8,10}, j = {1,2,4, 6 , 8}, k = {l,y/ 2 , 2, 2^2, 4, 

4%/2, 8}, 1 = {1,2,3,4,5,6,7,8,9|. 

< fibasel ] 

360, 240, 180, 120, 90, 72, 60 


“medium” and “hard” problems on the synthetic dataset. Successful two-view 
matching was defined as recovering n > 50 ground truth correspondences on a 
synthetically warped image. The threshold is set high - synthetic warping of 
an image is underestimating the reduction of the number of matchable features 
induced by the effects of a corresponding viewpoint change e.g. due to non¬ 
planarity of the scene or illumination changes. The matcher is considered to 
solve an “easy” problem if percentage of the matched images / > 50% of total 
images, “medium” if / > 90% of images matched and solved “hard” if / > 99% 
of the images are matched. 

The experiment with the synthetically warped dataset gives a hint about 
the limits of configurations. Three configurations that solved the maximum tilt 
difference for each case fastest for a given detector were selected for evaluation. 
The configurations are specified in Table [4] 

The average time necessary to match a given synthetic tilt difference for 
different detectors with the optimal configuration is shown in Figure [6] The 
computations were performed on the Intel i7 3.9GHz (8 cores) desktop with 
8 Gb RAM with parallel processing. 

Note that view synthesis significantly increases the matching performance 
of all detectors, but not uniformly. The left plot of Figure [6] shows that a very 
sparse viewsphere sampling greatly improves matching at almost no computa¬ 
tional cost for all detectors. However, after reaching a certain density, additional 
views do not add correspondences in the hardest cases - see the right graph of 
Figure [6] The ORB detector-descriptor clearly outperforms other detectors in 
terms of speed, but fails to match all images with the maximum tilt difference. 
The Hessian-Affine shown the best performance and it matched all pairs. 

4-2. First geometrically inconsistent nearest neighbor ratio correspondence se¬ 
lection strategy 

The following protocol was used to find the thresholds and to evaluate the 
performance of the proposed First Geometrically Inconsistent Nearest Neighbor 
FGINN strategy. First, similarity covariant regions were detected using the 
DoG detector (we also tried Hessian-Affine, MSER and SURF, with very sim¬ 
ilar results) and described using four popular descriptors - RootSIFT, SURF, 
LIOP |35j and MROGH [2D] which are typically matched with the second- 
nearest region SNN strategy. Then for each keypoint descriptor, the first, second 
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Figure 6: Performance of view synthesis configurations on the synthetic dataset. Average 
time needed to match (left) “easy”, (center) “medium” and (right) “hard” problems. The time 
for the fastest detector configuration that solved corresponding problems for a given tilt is 
shown. 


and first geometrically inconsistent descriptors in the other image were found. 
The matching keypoints were then labeled as correct if their Sampson error was 
within 1 pixel of the ground truth location given by homography for the image 
pair, and incorrect otherwise. 

The experiment was performed on 26 image pairs of the publicly avail¬ 
able datasets [9],|34j (image pairs 1-3, precise homography provided) and [36] 
(the homography was estimated using provided precise ground truth correspon¬ 
dences). The recall-precision curves for correspondences from all images were 
plot with a varying ratio threshold from 0 to 1 in Figure [7[ The FGINN curves 
for SIFT and SURF slightly outperform standard SNNs, while for LIOP and 
MROGH the difference is much more significant. The significantly higher ben¬ 
efit of the FGINN rule for LIOP and MROGH can be explained by their lower 
sensitivity to keypoint shift which in turn means that undesirable suppression 
of keypoints happens in a larger neighborhood. The lower sensitivity to shifts 
was experimentally verified. 


5. Experiments 

We have tested MODS and, as a baseline, ASIFT0 and single detector con¬ 
figurations specified in Table |4]on seven public, datasets [TO].[30 ] .[37 ] . [38 . [391, 

m,m- 


Implementation details of the MODS algorithm and parameter setting. The kd- 
tree algorithm from FLANN library [42 was used to efficiently find the N-closest 
descriptors. The distance ratio thresholds of the FGINN matching strategy 
were experimentally selected based on the CDFs of matching and non-matching 
descriptors. 

The MODS algorithm allows to set the minimum desired number of inkers 
which have a very low probability to be a random result as a stopping criterion. 


4 Reference code from http://demo.ipol.im/demo/my_affine_sift 
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Table 4: View synthesis configurations with best synthetic dataset performance. 



Configurations 

Detector 

Easy 

Medium 

Hard 

DoG-SIFT 

(0 = {i; 5; 9}, 

A<j> = 360 °/t 

{*} = 

(1; 2; 3; 4; 5; 6; 7; 8; 9}, 
A<j> = 180°/t 

{t} = (1; 2; 4; 6; 8}, 

A0 = 60°/t 

HarrAff-SIFT 

{*} = {!; 3; 6; 9}, 

A <t> = 360 °/t 

{*} = 

(1; 2; 3; 4; 5; 6; 7; 8; 9}, 
A <)> = 360 °/t 

W = 

(1; 2; 3; 4; 5; 6; 7; 8; 9}, 
A 4> = 120 °/t 

HessAff-SIFT 

{t} = {l;5;9}, 

A <f> = 360 °/t 

|t} = {l;5;9}, 

A <t> = 360 °/t 

(tj = {1; 2; 4; 6; 8}, 

A <t> = 60°/t 

MSER-SIFT 

{t} = {l;8}, 

A4> = 360 °/t 

{*} = 

{1; V2; 2; 2V2; 4; 

4\/2; 8}, A(j> = m°/t 

{t} = (1; 3; 6; 9}, 

A 4> = 60 °/t 

AGAST-FREAK 

{*} = {!; 4; 8; 10}, 

A<t> = 360 °/t 

{«} = {1; 3; 6; 9}, 

A <fi = 60° /t 

{t} = {1; 3; 6; 9}, 

A 4> = 72°/t 

ORB 

{*} = {!; 5; 9}, 

A<f> = 360 °/t 

W = 

(1; 2; 3; 4; 5; 6; 7; 8; 9}, 
A 4> = 90 °/t 

{t} = 

{1; y/2\ 2; 2\/2; 4; 
4V2;8}, A 4> = 72°/t 

SURF-FREAK 

{t} = 

{1; \/2; 2; 2\/2; 4; 

4 A; 8}, A<j> = 60 °/t 

{t} = 

{1; V2; 2; 2V2; 4; 

4\/2; 8}, A 4> = 90° /t 

W = 

{l;V2;2;2\/2;4; 

4\/2; 8}, A <{> = 72°/t 

SURF-SURF 

{t} = {l;5;9}, 

A <j> = 360 °/t 

{t} = 

(1; 2; 3; 4; 5; 6; 7; 8; 9}, 
A <t> = 360 °/t 

W = 

(1; 2; 3; 4; 5; 6; 7; 8; 9}, 
A <t> = 72°/t 


The recommended value - 15 inliers to the homography - did not produce a 
false positive results in experiments. Computations were performed on Intel i3 
CPU @ 2.6GHz with 4Gb RAM with 4 cores. 

5.1. MODS variants testing on Extreme Viewpoint and Oxford Dataset 

To evaluate the performance of matching algorithms, we introduce a two- 
view matching evaluation dataset^] with extreme viewpoint changes, see Table [H] 
The dataset includes image pairs from publicly available datasets: ADAM and 
MAG [16], GRAF [9] and there 34. The ground truth homography matrices 
were estimated by LO-RANSAC using correspondences from all detectors in 
view synthesis configuration {t} = {1; y/2\ 2; 2\/2; 4; 4\/2; 8}, A <p = 72 °/t. The 
number of inliers for each image pair was > 50 and the homographies were 
manually inspected. For the image pairs GRAF and there precise homogra¬ 
phies are provided by Cordes et.al. [ 33 ]. Transition tilts r were computed using 
equation 0 with SVD decomposition of the linearized homography at center 
of the first image of the pair (see Table [5|. Oxford [9] dataset with 42 image 
pairs (1-2, ..., 1-6) was used for easier wide baseline problems. 

Experimental protocol. The evaluated algorithms matched image pairs and the 
output keypoints correspondences were checked with ground truth homogra- 


5 Available at http://cmp.felk.cvut.cz/wbs/index.html 
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Figure 7: Comparison of the FGINN and SNN matching strategies for SIFT, SURF, LIOP 
and MROGH on images from the Oxford |9] and Cordes et.al. m datasets. Markers show 
operationg points for the common distance ratio threshold = 0.8. Matching without (left) 
view synthesis an using view synthesis (right) with parameters {£} = {1; 5; 9}, Ac/) = 360 °/t. 
The recall and precision of the correspondence filtering step, therefore the maximum recall is 
1 when all correspondences are kept. 


phies. The image pair is considered as solved, when at least 10 output corre¬ 
spondences are correct. 

Figure [8] compares the different view synthesis configurations. Note that no 
single detector solved all image pairs. The Hessian-Affine, MSER, Harris-Affine 
and DoG successfully solved resp. 13, 13, 12 and 13 out of the 15 image pairs 
however, at the expense of the high computational cost. We also noticed that 
if one would know the suitable detector and configuration for each image, it is 
possible to match all image pairs. 



Figure 8: Performance of different configurations on the Extreme View Dataset. Cumulative 
percentage of image pairs successfully matched in a given time for configurations defined in 
Table Each mark represents one image pair. The fastest detector configuration which 
was able to match each pair was selected. Left - ’easy’, middle - ’medium’, left - ’hard’ 
configurations. An images is considered matched if 10 correct inliers were found. 


The MODS algorithm with more time-consuming configurations solves all 
image pairs and does it faster than a suitable configuration for each image pair 
- see Figure [9j We have tested several variants of the MODS configurations, 
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Table 5: The Extreme View Dataset - EVD. Image sources: C - |34|. Ox - ^9], M - 16 . 
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Figure 9: Performance of MODS configurations specified in Table [2] Cumulative percentage 
of image pairs successfully matched in time. Left - EVD dataset, right - Oxford dataset. 
The graphs are cropped. Each mark represents one image pair. The fastest single detector 
configuration which is able to match each pair was selected and plotted separately. Image 
considered matched if 10 correct inliers was found. 
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Table 6: ASIFT view synthesis configuration 


Detector 

View synthesis setup 

DoG 

{S} = {!},{*} = {1; x/2; 2; 2\/2; 4; 4\/2; 8}, A<j> = 72°/t 


stated in Table [2} Experiments shows that the proposed MODS configuration 
is very fast on the easy WBS problems as in Oxford dataset (see Figure |9j right 
graph) and has very little overhead on the harder EVD dataset - it is the second 
best after configuration without ORB steps. The results of the MODS medium 
configuration - without first sparse synthesis step - shows fruitfulness of the 
progressive view synthesis. 

ASIFT is able to match only 6 image pairs from the dataset. The ASIFT 
algorithm generates a lower number of correct inliers and works slower than our 
identical DoG configuration (which has the same tilt-rotation set). The main 
causes are the elimination of ”one-to-many”, including correct, correspondences, 
the inferiority of the standard second closest ratio matching strategy and a 
simple brute-force algorithm of matching used in ASIFT. 

Fig. [lO] shows the breakdown of the computational time. The most time 
consuming parts - detection and description (including the dominant orientation 
estimation) - take 40% and 35% resp. of the all time. Without applying the fast 
SIFT computation from [29] , the SIFT description takes more than 50% of the 
time. The ORB is an exception - the synthesis is not so profitable, since it takes 
more time than detection and description itself. Note that the whole process is 
almost linear in the area of the synthesized views. The only super-linear part, 
matching, takes only 10% of the time. 



I Synthesis 
] Detection 
] Orientation 
] Description 
] Matching 
IRANSAC 
|Misc 


40 60 

Time [%] 


100 


Figure 10: Percentages of time spent in the main stages of the matching with view synthe¬ 
sis process on a single core, EASY configuration. Detection and SIFT description, i.e. the 
dominant gradient estimation and the descriptor computation are the most time-consuming 
parts. 
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5.2. MODS testing on a non-planar dataset 
5.2.1. The dataset and evaluation protocol 

The evaluation dataset consists of 35 image sequences taken from the Turntable 
dataset m ( “Bottom” camera) shown in Table [7] Eight image sets contain 
objects with relatively large planar surfaces and the remaining ones are low- 
textured, “general 3D” objects. 

The view marked as “0°” in the Turntable dataset was used as a reference 
view and 0—90° and 270-355 ° views with a 5° step were matched against it using 
the procedure described in Sec. [5j forming a [—90°,90°] sequence. Note that 
the reference view is not usually the “frontal” or “side” view, but rather some 
intermediate view which caused asymmetry in results (see Fig. 11, Table [8]). 


The output of the matchers is a set of the correspondences and the estimated 
geometrical transformation. The accuracy of the matched correspondences was 
chosen as the performance criterion, similarly to the protocol in [43]. For all out¬ 
put correspondences, the symmetrical epipolar error [27] esymEG was computed 
according to the following expression: 


e SymEG (F, •>, V> = (v T Fu ) 2 X ( (pu)f j (Pu) 2 + (F T v) 2 [ (JJ T v) 2 ) • P) 

where F - fundamental matrix, u, v - corresponding points, (Fu)| - the square 
of the j- th entry of the vector Fu. 

The ground truth fundamental matrix was obtained from the difference in 
camera positions [27], assuming that turntable is fixed and the camera moved 
around the object, according to the following equation: 


F = K~ T RK T [KR T t ] x , 

/ cos <j> 0 -sine/) \ / 0 f 

R = o 1 0 h* = 0 & f 

\ sine/) 0 cos cf) J \ 0 0 1 


( sin (j) 

0 

1 — cos cj> 


(3) 


where R is the orientation matrix of the second camera, K - the camera pro¬ 
jection matrix, t - the virtual translation of the second camera, r - the distance 
from camera to the object, (j) - the viewpoint angle difference, FRx, FRy - the 
focal plane resolution, / - the focal length, m, n - the sensor matrix width and 
height in pixels. The last five parameters were obtained from EXIF data. 

One of the evaluation problems is that background regions, i.e. regions that 
are not on the object placed on the turntable, are often detected and matched 
influencing the geometry transformation estimation. The matches are correct, 
but consistent with an identity transform of the (background of) the test images, 
not the fundamental matrix associated with the movement of the object on the 
turntable. In order to solve this problem, the median value of the correspondence 
errors was chosen as the measure of precision because of its tolerance to the low 
number of outliers (e.g. the above-mentioned background correspondences), and 
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Table 7: Reference views of the image sequences used in the evaluation (from |10|h 
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Table 8: Experiment on the 3D dataset. The configurations are defined in Table [4] The 


number of correctly matched image pairs and the run-time per pair. Best results are boxed 
Results within 90% of the best are in bold. The configurations are sorted by average time 
for image pair. 


Image sets solved (out of 35) 


Matcher 


0 ° 


Viewpoint angular difference 
10° 15° 20° 25° 30° 35° 


40° > 45° 


Time 

[si 


HessianAffine easy 


35 32 29 26 


22 


17 13 14 


9 6 


0.9 


HessianAffine medium 

35 

30 

26 

24 

18 

17 

12 

10 

7 

10 

1.0 

ORB easy 

35 

31 

27 

20 

16 

11 

10 

5 

5 

4 

1.0 

AGAST-FREAK easy 

35 

28 

29 

26 

19 

17 

12 

9 

7 

6 

1.3 

DoG easy 

35 

33 


30 

22 

18 

12 

10 

7 

5 

3 

1.4 

MSER easy 

35 

28 

22 

21 

15 

13 

10 

8 

7 

5 

1.5 

SURF-SURF hard 

35 

26 

16 

11 

9 

7 

6 

4 

3 

3 

2.1 

HarrisAffine easy 

35 

29 

22 

12 

8 

6 

5 

3 

2 

1 

2.7 

AGAST-FREAK hard 

35 

31 

28 

25 

20 

17 

15 

10 

8 

5 

4.7 

HarrisAffine medium 

35 

27 

20 

11 

7 

6 

5 

4 

3 

2 

4.7 

HessianAffine hard 

35 

29 

24 

20 

20 

17 

15 

11 

9 

6 

5.6 

DoG medium 

35 

32 

26 

21 

16 

11 

10 

9 

6 

5 

6.8 

AGAST-FREAK medium 

35 

29 

23 

22 

18 

14 

13 

6 

6 

6 

7.2 

ORB hard 

35 

31 

24 

19 

16 

7 

5 

4 

1 

4 

7.3 

SURF-FREAK easy 

35 

25 

21 

13 

10 

8 

6 

3 

2 

2 

7.3 

ORB medium 

35 

30 

26 

18 

17 

10 

5 

4 

2 

4 

8.3 

SURF-FREAK medium 

35 

22 

15 

10 

9 

7 

6 

4 

1 

2 

9.1 

SURF-SURF medium 

35 

25 

20 

13 

11 

7 

3 

3 

2 

2 

10.7 

SURF-FREAK hard 

35 

25 

17 

14 

10 

9 

4 

3 

1 

2 

10.9 

MSER hard 

35 

29 

24 

24 

19 

16 

12 

13 

8 

8 

14.5 

HarrisAffine hard 

35 

27 

16 

8 

5 

5 

5 

3 

2 

2 

15.2 

DoG hard 

35 

32 

24 

17 

16 

13 

12 

8 

7 

8 

16.7 

MSER medium 

35 

29 

26 

23 

21 

14 

10 

9 

7 

9 

17.5 


ASIFT 


35 32 24 18 13 


27.6 


MODS:l-7 


35 31 27 


27 22 


22 


14 


10 


11 


6.7 


its sensitivity to the incorrect geometric model estimated by RANSAC. 

An image pair is considered as correctly matched if the median symmetrical 
epipolar error on the correspondences using ground truth fundamental matrix 
is < 6 pixels. 

5.2.2. Results 

Figure [Tl] and Table [8] show the percentage and the number of image se¬ 
quences respectively for which the reference and tested views for the given 
viewing angle difference were matched correctly. 

The difference between EASY, medium and hard configurations is small for 
structured scenes — unlike planar ones. Difficulties in matching are caused 
not by the inability to detect distorted regions but by object self-occlusions. 
Therefore synthesis of the additional views does not bring more correspondences. 

Experiments with view synthesis confirmed m results that the Hessian- 
Affine outperforms other detectors for matching of structured scenes and can 
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Figure 11: A comparison of view synthesis co^gurations on the Turntable dataset |10|1 . The 
fraction of correctly matched images for a given viewpoint difference. 





























Figure 12: Correspondences found by MODS. Green - corresponding regions, cyan - epipolar 
lines. Objects with significant self-occlusion and mostly homogenious texture were selected. 
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Figure 13: Examples from the Extreme Zoom Dataset 


be used alone in such scenes. MODS shows similar performance, but is slower 
than the Hessian-Affine configuration. 

The computations were performed on Intel i5 3.0GHz (4 cores) desktop with 
16Gb RAM. Examples of the matched images are shown in Fig. [12] 

5.3. MODS testing on other datasets 
5.3.1. Extreme zoom dataset 

We introduce the Extreme Zoom dataset (EZD), which is a small subset of 
the retrieval dataset used in [41]. It consists of six sets of images with an in¬ 
creasing level of zoom (see examples in Figure [l3|). The state-of-the art matcher 
- ASIFT [16j and registration algorithm DualBootstrap [44] as well a results for 
MSER, ORB and Hessian-Affine matchers without view synthesis were com¬ 
pared to MODS. Image pairs are matched with tested algorithms. An image 
pair is considered solved when at least 10 output correspondences are correct. 
Results are shown in Table [9) 


Table 9: Results on the Extreme Zoom Dataset. 


Matcher 

I (max 

Zoom level, matched 
6) II (max 6) III (max 6) 

IV (max 4) 

ORB 

0 

0 

0 

0 

MSER 

2 

2 

1 

0 

DBstrap 

3 

1 

0 

0 

HessAff 

5 

3 

2 

0 

ASIFT 

3 

3 

2 

0 

MODS 

6 

3 

3 

0 


5.3.2. Ultra wide-baseline dataset 

The MODS performance was evaluated on the city-from-air dataset from 
m- The dataset comprises 30 pairs (examples shown in Figure [14]) of pho¬ 
tographs of buildings taken from the air. The view points difference is quite 
large, the images contain repeated structures, illumination differs. The authors 
proposed a matcher based on HoG m descriptor with view synthesis and com¬ 
pare it to ASIFT and D-Nets [46] for skyscraper frontal face matching. We 
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follow the evaluation protocol which considers a pair matched correct only if 
the facade plane is matched (> 75% correct inliers). If the output homography 


was ground/roof, it is considered incorrect. The results are shown in Table 10 


Note that no special adjustment is done in MODS for homography selection, 
so the reported performance is a lower bound. 



Figure 14: Examples of image pairs from the Ultra wide-baseline dataset E3 


Table 10: Results on Ultra wide-baseline dataset (all results except MODS taken from EH) 


Method 

Correct (or 

shifted) Different plane Failure/ground plane 

Altwaijry and Belongie [37] 

9 

1 

20 

ASIFT-homography 

1 

5 

24 

D-Nets 

7 

2 

21 

MODS 

8 

1 

21 


5.3.3. Datasets with other than geometric changes 

Despite being designed for (extreme) wide baseline stereo problems, MODS 
performance was evaluated on other datasets: GDB-ICP [39] (modality, view¬ 
point and photometry changes), SymBench [38; (photometrical changes and 


photo-vs-painting pairs), and MMS [40] (infrared-vs-visible pairs) - see Table 11 


The state-of-the art matcher - ASIFT [I6j and registration algorithm DualBoot- 
strap [44j as well a results for MSER, ORB and Hessian-Affine matchers without 
view synthesis were compared to MODS. 


_ Table 11: Evaluation Datasets _ 

Short name Proposed by ^images Nuisance 

GDB-ICP Kelman et.al. [39 1. 2007 22 pairs Illumination, modality 

SymBench Hauagge and Snavely m, 2012 46 pairs Illumination, modality 

MMS Aguilera et.al. |40| . 2012 100 pairs Modality 

EVD Mishkin et.al. [30| . 2013 15 pairs Viewpoint change 

Image pairs are matched with the tested algorithms. Output keypoints cor¬ 
respondences were checked against the ground truth homographies. An image 
pair is considered solved when at least 10 output correspondences are correct 
Our primary evaluation criterion is the ability to find sufficiently correct geo¬ 
metric transformations in a reasonable time; accurate geometry can be found in 
consecutive step. 
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The computations were performed on Intel i3 3.0GHz desktop (4 cores) with 
4Gb RAM. 

Results are shown in the Table fl2l MODS is the fastest method and it is able 
to match the most image pairs in GDB-ICP and SymBench datasets without 
using symmetrical parts or other problem-specific features. Images from the 
MMS dataset (as well as other thermal images) produce a small number of 
features as they do not contain many textured surfaces, and have very short 
geometrical baseline. Those are the main reasons why area-based method - 
Dual-Bootstrap - work significantly better than the feature-based methods. 

After lowering the threshold for detectors allowing to detect more feature 
points and using the orientation restricted SIFT j47] in addition to the Root- 
SIFT, MODS-IR solved 83 out of 100 image pairs from MMS dataset. 


Table 12: Performance on non-WBS datasets. For comparison, results of MSER, ORB and 
Hessian-Affine matchers without view synthesis are added. Best results are in bold, average 
time per image pair is shown. 


Matcher 

GDB-ICP 

SymBench 

MMS 

EVD 


pairs 

time 

pairs 

time 

pairs time 

pairs 

time 


solved 

M 

solved 

M 

solved 

M 

solved 

M 

ASIFT 

15/22 

41.5 

27/46 

14.7 

8/100 

3.2 

5/15 

12.4 

MODS:l-7 

17/22 

2.8 

38/46 

3.7 

12/100 

2.0 

15/15 

2.4 

DBstrap 

16/22 

17.6 

38/46 

21.7 

79/100 

9.3 

0/15 

1.9 

ORB 

0/22 

0.2 

0/46 

0.4 

0/100 

0.1 

0/15 

0.3 

MSER 

8/22 

1.7 

21/46 

0.8 

0/100 

0.2 

4/15 

0.6 

HessAff 

11/22 

1.9 

29/46 

1.5 

2/100 

0.4 

2/15 

1.1 

MODS-IR:l-7 

21/22 

7.6 

39/46 

15.1 

83/100 

9.1 

14/15 

8.6 


6. Conclusions 

An algorithm for two-view matching called Matching On Demand with view 
Synthesis algorithm (MODS) was introduced. The most important contribu¬ 
tions of the algorithm are its ability to adjust its complexity to the problem 
at hand, and its robustness, i.e. the ability to solve a broader range of wide- 
baseline problems than the state of the art. This is achieved while being fast 
on simple problems. 

The apparent robustness vs. speed trade-off is finessed by the use of progres¬ 
sively more time-consuming feature detectors, and by on-demand generation of 
synthesized images that is performed until a reliable estimate of geometry is 
obtained. The MODS method demonstrates that the answer to the question 
”which detector is the best?” depends on the problem at hand, and that it is 
fruitful to focus on the ”how to combine detectors” problem. 

We are the first to propose view synthesis for two-view wide-baseline match¬ 
ing with affine-covariant detectors, which is superficially counter-intuitive, and 
we show that matching with the Hessian-Affine or MSER detectors outperforms 
the state-of-the-art ASIFT. View synthesis performs well when used with simple 
and very fast detectors like ORB, which obtains results similar to ASIFT but 
in orders of magnitude shorter time. 
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Minor contributions include an improved method for tentative correspon¬ 
dence selection, applicable both with and without view synthesis and a modifi¬ 
cation of the standard first to second nearest distance rule increases the number 
of correct matches by 5-20% at no additional computational cost. 

The evaluation of the MODS algorithm was carried out both on standard 
publicly available datasets as well as a new set of geometrically challenging wide 
baseline problems that we collected and will make public. The experiments show 
that the MODS algorithm solves matching problems beyond the state-of-the- 
art and yet is comparable in speed to standard wide-baseline matchers on easy 
problems. Moreover, MODS performs well on other classes of difficult two-view 
problems like matching of images from different modalities, with large difference 
of acquisition times or with significant lighting changes. 
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